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Abstract 

Background: Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn 
graph representing all /c-mers from the reads. The memory requirements for storing all /c-mers in a lookup table can be 
demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure. 

Results: The FM-index, which is based on the Burrows-Wheeler transform, provides an efficient data structure 
providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes 
for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the 
strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the 
same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the 
kFM-index, for indexing the set of /c-mers from the reads. For DNA sequences, this requires 5 bit of information for 
each vertex of the corresponding de Bruijn subgraph, i.e. for each different k — 1-mer, plus some additional overhead, 
typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph 
and reproducing the actual /c-mers efficiently. 

Conclusions: The kFM-index could replace more memory demanding data structures for storing the de Bruijn /c-mer 
graph representation of sequence reads. A Java implementation with additional technical documentation is provided 
which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/). 



Background 

High throughput sequencing is generating huge amounts 
of sequence data even from single experiments. The 
raw sequence data will typically be too much to keep 
in the memory of most off-the-shelf computers, and 
with sequencing technologies progressing faster than the 
improvements in computer memory, the memory chal- 
lenge is likely to increase in the future. 

One key property of the raw sequencing data is that it 
is highly redundant. Genomes are usually sequenced at 
high coverage, which means there will frequently be at 
least 30-50 reads covering the same region of the genome, 
differing primarily by sequencing errors. Processing of 
sequencing reads for genome assembly usually involves 
two crucial steps: error correction to remove or correct 
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sequencing errors, and assembly of overlapping reads to 
produce a smaller number of assembled sequences. 

A common approach for simplifying the processing of 
the sequence data is to consider all the /c-mers of the 
reads: i.e. all the /c-substrings of the reads if we view 
them as strings. This set of /c-strings is then thought of 
as a subgraph of the de Bruijn graph of order k — I: 
i.e. one which has vertices corresponding to all k — 1- 
substrings and edges corresponding to the /c- substrings. 
Even if sequenced at high coverage, each /c-mer is thus 
represented only once, reducing the redundancy of the 
sequence data considerably. However, direct storage of all 
/c-mers in a single list will require k letters per /c-mer, i.e. 
2/c bit of information for DNA sequences, which can be 
quite memory consuming when k is large. 

Naively, one might expect that this could be greately 
improved. From each vertex in the graph, there may be 4 
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possible out-going (or in-coming) edges if the graph rep- 
resents DNA sequences: one for each of the nucleotides. 
Encoding which of these exist in the graph should require 
only 4 bit of information per vertex; if most vertices have 
only one out-edge, this might even be reduced towards 
2 bit of information per vertex by only encoding which 
of the 4 possible edges is actually found. Of course, this 
approach requires that the vertices be known, but one 
might envision that the information about the vertices 
could be reconstructed when walking the graph: when 
walking k — 1 steps, all k —1 letters of the resulting vertex 
will be known. 

A traversable representation of the de Bruijn subgraph 
is equivalent to storing a searchable index of all the k- 
substrings. By traversable, I mean that it is possible to 
efficiently walk the graph starting at any vertex, to check 
if any /c-mer or k — 1-mer is present as an edge or ver- 
tex in the graph, and preferably also to be able to retrieve 
the /c-mers and k — 1-mers represented by the graph. 
Thus, it is not only important that the data structure be 
compact, but efficient algorithms for using it are just as 
important. 

A number of data structures exist that provide more 
compact storage of the de Bruijn subgraph than naive 
/c-mer lists or maps. Conway et al. [1] were able to rep- 
resent a de Bruijn subgraph with 12 G edges in 40.8 GB, 
i.e. 28.5 bit per edge, by using a compressed array. Other 
approaches reduce memory by storing only a subset of the 
/c-mers [2-4]. 

An entirely different approach uses a Bloom filter to 
store a hashed set of /c-mers [5] using only 4 bit per /c-mer. 
This is a probabilistic data structure with a known false 
positive rate, but where false positive edges can be iden- 
tified by not being part of longer paths. However, while 
this data structure is effective for checking if a /c-mer is 
contained in the graph, it does not easily allow listing of 
all vertices or edges. An enhancement of this method, 
Minia [6], avoids critical false positives and also allows 
retrieval of all vertices, but at the cost of higher memory 
consumption. 

Another memory-efficient solution uses the FM-index 
[7], which is based on the Burrows- Wheeler transform 
[8] used to represent a suffix array [9], to store the col- 
lection of reads in a compressed form [10]. The Burrows- 
Wheeler transform was originally developed for text com- 
pression and has the property that recurrent substrings in 
the text before the transform result in single-letter repeats 
in the transformed string. The FM-index adds auxiliary 
information on top of the Burrows -Wheeler transformed 
sequence that effectively turns it into a compactly stored 
suffix array. When concatenating the reads, the cover- 
age makes the Burrows-Wheeler transformed sequence 
dominated by single-letter repeats which are highly com- 
pressible [10]. Effectively, it requires 2 bit per edge to store 



the nucleotide, which corresponds to specifying the in- 
edge (or out-edge) of a vertex, and additional memory 
to store the run-length of the nucleotide, which corre- 
sponds to the /c-mer count. At least up to 50 times cover- 
age, this data structure should be able to store one edge 
per byte if used to represent the de Bruijn subgraph of 
/c-mers. 

It should be noted that the ability of different methods to 
handle read errors varies. Some of the cited methods are 
intended to perform error correction by filtering /c-mers 
by their frequency, while other methods assume that read 
errors for the most part have been corrected or excluded 
in advance. 

I here provide a data structure with strong similar- 
ities to the FM-index, but which stores the de Bruijn 
subgraph representing the /c-mer substrings rather than 
entire sequences. It is based on the idea of storing for 
each vertex which of the possible in-coming edges are 
actually present. For each vertex it thus needs one bit 
of information per letter in the alphabet, i.e. 4 bit per 
vertex for DNA sequences, plus some additional data. 
The additional data consists of a grouping of vertices 
which requires one extra bit per vertex, plus the equiv- 
alent to the FM-index for mapping in-coming edges to 
their parent vertices. This version of the FM-index, which 
I call the kFM-index since it applies to an index of k- 
substrings, can be generated from the stored data, but 
for computational speed a subset of the index is kept 
in memory. All in all, a de Bruijn subgraph for DNA 
sequences, including the stored subset if the index, can be 
stored using 5-6 bits per vertex if memory consumption 
is critical. In the case where most vertices are of degree 
1, i.e. have one in-edge and one out-edge, the stored 
data may be compressed down to approximately half the 
size. 

Like the FM-index, the kFM-index stores only one 
strand of DNA sequences, and is suitable for walking the 
graph in one direction. For genome assembly, one does 
not know in advance which strand the read is on, and so 
normally are required to ensure that both the k-mevs of 
the reads and their reverse complements are added to the 
graph. Some data structures, e.g. most hashing strategies, 
can combine /c-mers and their reverse complements, and 
thus require roughly half the number of items. For the 
kFM-index, however, it is necessary to add both the reads 
and their reverse complements. In doing this, one may 
walk in the opposite direction by switching to the reverse 
complement, although there will be some computational 
overhead in doing so. 

The basic operations available on the kFM-index are 
similar to those of the FM-index. Each vertex is iden- 
tified by it's index position, / = 0, 1, — 1 where 
n is the number of vertices in the de Bruijn subgraph 
and the vertices are lexicographically ordered. For a given 
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string, the vertices having that string as a prefix, iden- 
tified by the interval of index positions, can be found 
efficiently: the computational time is proportional to 
the length of the string. Given a vertex, identified by 
its index position /, one can look up directly in the 
stored data which in-coming edges exist for that ver- 
tex. The index positions of the vertices from which the 
in-edges come can be computed efficiently. Thus, check- 
ing if a string exists as a path in the de Bruijn subgraph 
can be done. The reverse operation of identifying the 
string representation of a given vertex identified by index 
position / also exists, but is slower: time complexity is 
0(k\gn), 

The kFM-index can be generated directly from a 
sorted list of in-edges, which is appropriate for amounts 
of sequence data that fit into the computer memory, 
although it should also be feasible to extend this by 
sorting the in-edges on disk: the time complexity is 
0(Nk Ig (7 IgN) where N is the total length of the sequence 
data, and thus the number of items to be sorted, and 
NklgG is the amount of data being sorted. Generation of 
kFM-indexes in memory from sequentially read sequence 
data can be done by splitting the raw sequence data into 
parts, generate kFM-indexes for each part, and then per- 
form pairwise merges of these kFM-indexes. The time 
complexity of generating the kFM-index in this manner 
is essentially 0(Nkcr \g(nm)), where n is the number of 
vertices in the final de Bruijn graph (i.e. not counting iden- 
tical /c-mers), a is the alphabet size, and m is the number 
of parts the initial sequence data is partitioned into. This 
has proven to be quite time consuming: in part because 
of the time complexity of the provided merge algorithm, 
but probably also in part due to an inefficient implemen- 
tation. I expect that there is room for major improve- 
ments. In addition, these operations are all open to 
parallelisation. 

Readers familiar with the FM-index will see the similar- 
ities to it, despite the fact that the FM-index represents all 
suffixes while this new data structure only stores informa- 
tion about /:-substrings. Not only is the data structure very 
similar, but the functions and algorithms are also similar, 
or at least analogous, to those used with the FM-index. 
I therefore refer to this data structure as the kFM-index: 
an FM-index for /:-substrings. And instead of pointing 
out the similarities throughout the article, I will point out 
differences where these are noteworthy. 

A Java implementation of the data structure is provided 
as a demonstration. 

Methods 

Notation 

Let E denote an alphabet of size a = |E|, i.e. an arbi- 
trary set whose elements we refer to as letters: for DNA 



sequences, E = {A, C, G, T} and <7 =4. A string of length /, 
or an /-string, is an element oix ^ E^. Let E* = Ug^E^ 
denote the set of all strings, including the empty string 
denoted 6. We denote the length of the string by \x\, li x 
and y are strings, xy denotes the concatenated string of 
length \x\ + \y\; for sets U and V of strings, the set of 
concatenated strings is denoted U o V = {uv\u G Z/, 
ve V}. 

We write x < y to indicate that string x sorts lexico- 
graphically before y based on an ordering of the letters in 
E. In addition to the letters in E, we have two special char- 
acters $ and oo with the properties that % < a < ooiov all 
a eT^, 

If X is an /-string, we write x = xi . . .xi where 
Xi e E are the letters. For p < q, the [p^q] sub- 
string X[p^q] = Xp . . .Xq is 2i String of length q — p -\- 1: 
^{p,p-i] is the empty string. A substring X[i^p] at the start 
is referred to as a prefix^ while a substring x\p^i\ at the 
end is referred to as a suffix. The operation of trim- 
ming away the last letter is denoted x~ = = 
xi . . .Xl-i, 

For S = . . .,sm) a list of strings, i.e. 5/ e E*, let 
^U<] (denote the set of length k substrings: i.e. x e 

is contained in S^^^ if and only if there is a string s e S 
with X = S[p^p_^j^_i] for some position p. 

We denote the base 2 logarithm by Ig^ = ^og2^ which 
is convenient for quantifying information. Thus, the infor- 
mation required to specify one out of n options is Ig n 
bit. 



Problem description 

Given a set S of strings, e.g. a set of sequencing reads, we 
will construct a compact representation of S^^\ i.e. the set 
of length k substrings, suitable for quickly checking if any 
particular A"-string is present. 

The data structure is best understood in terms of the 
de Bruijn subgraph representation of S^^\ This has ver- 
tices V = S^^~^^ and edges E = S^^^ where e e E 
is an edge from to e[2,A:]- It is a subgraph of the 

de Bruijn graph of order /: — 1, i.e. with vertices E^~^ 
and edges E^. Some authors may refer to this as a word 
graphy or even just a de Bruijn graph. Since the set of 
vertices can be deduced from the edges, storing S^^^ is 
effectively the same as storing the information encoded 
in the de Bruijn subgraph. However, the graph struc- 
ture highlights the overlap between edges meeting at 
vertices. 

While some authors focus on k as the length of the 
strings represented by the edges, others focus on the 
order of the graph which is the k — 1 length of the 
strings represented by the vertices. Since our purpose is 
to represent the k-mev composition of the sequences, it is 
natural to focus on k as the k-mev length. However, the 
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implementation of the algorithms is more naturally cen- 
tered around the vertices, and so the Java implementation 
focuses on the order of the de Bruijn subgraph which is 
k-h 



The kFM-index data structure 

The data structure for storing the /c-substrings S^^^ from 
a set of strings S has similarities to the FM-index and 
the Burrows-Wheeler transformation. One similarity is 
that the data structure stores the prefixing letters, which 
represent the in-edges to vertices, and backtracks the de 
Bruijn subgraph through these in-coming edges rather 
than walking paths from beginning to end; the sequences, 
including the strings the vertices and edges represent, are 
thus reconstructed from the in-edge data when backtrack- 
ing through the graph. 

The initial de Bruijn subgraph representing the k- 
string composition S^^^ may contain any number of final 
vertices: i.e. vertices for which there are no out-going 
edges. These final vertices correspond to k — 1 -strings 
found only as suffixes of the strings in 5, and repre- 
sent a problem as they cannot be reached by backtrack- 
ing the de Bruijn subgraph. As the data structure does 
not store the k — 1-strings for each vertex, but instead 
reconstructs these strings when walking the graph, these 
final vertices cannot be thus reconstructed. The solu- 
tion is to add extra vertices and edges leading from 
these final vertices to a special final vertex from which 



we may start the reconstruction. See Figure 1 for an 
example. 

Let V^final C S^^~^^ be a set that includes all k — 1-strings 
which are final vertices in the graph with edge set S^^^: i.e. 
if V G S^^~^^ and v is not a prefix of any string in 
then V has to be in Vfinal- Ideally, in order to get the most 
compact representation of ^S^^^ we want V^final to contain 
only these strings. However, we might start off by letting 
Vfinal contain all k — 1 -suffixes of the strings in 5, knowing 
that the vertices required to be in Vfinai have to be a subset 
of these, and then later prune away superfluous edges and 
vertices. Hence, we permit Vnn^i to be bigger than strictly 
required. 

We now define the final- completed de Bruijn subgraph 
with paths added from each v e Vfinai to a special vertex, 
=$...$ which we refer to as the final vertex, having 
vertices 

V = U (V^flnal o $^-1)'^"" U {$^^-1} (1) 

and edges 

E = S^'^u(Vi,,,,o$'-'f^ (2) 

where 1/finai o $^"^ = {v$ . . . $ | v G Vfmal} denotes the 
strings from which these additional paths are constructed. 
These are strings over an extended alphabet E U{$}, where 
$ is a special character that is sorted before any of the let- 
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Figure 1 The kFM-index data and corresponding de Bruijn subgraph. Representation of the data structure for DNA 4-mers. The vertex strings, 
lexigographically sorted, are not stored, but reconstructed from the edge and group end data. The edges columns indicate in-coming edges to each 
vertex, i.e. letters that may prefix the vertex strings. The group end flag inidicates groups of vertices with the same k - 2-prefix. The previous position 
data can be generated from the edge set data and group end data and is constant within each vertex group; a subset is stored for computational 
speed. 
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ters of E. The added vertices, i.e. those containing one 
or more $ at the end, are referred to as final- completing 
vertices and are parts of paths leading to the final vertex. 
In fact, the final-completing vertices form a tree with the 
final vertex, $^~^, as the root. Note that, for the case where 
Vfinal is empty, we explicitly add the special vertex $^~^: 
this is purely a matter of convenience. 

By this extension of the de Bruijn subgraph, we have 
ensured that there is exactly one final vertex that cannot 
be reached by backtracking the graph, namely the final 
vertex $^^~^. When sorting the vertices lexicographically, 
this will always come first. Note that we do not require 
that the final vertex be reachable from the rest of the 
graph. If the original graph had Vfinal empty, this would be 
the case. 

We may identify E with a subset of E x V describing the 
set of in-coming edges to each vertex, and will by abuse of 
notation say that the pair v) G E x 1/ is an edge if the 
concatenation av g E, We denote the in-coming edges to 
V by £v C E: i.e. Ey = [a eT^ \ av e £}. 

Note that backtracking through this de Bruijn subgraph 
corresponds to reading the strings in the backwards direc- 
tion, from the end of the string towards the beginning, 
just as with the FM-index. A variant of the data structure 
which naturally reads the strings in the forward direction 
can be obtained by performing the construction on the 
reversed strings, the only effect of which is on the sorting 
of strings in the index which would then be based on the 
reversed string. 



u = U[i^k-2\ and v = V[ij,_2] are 
identical We indicate the group end 
by a flag^' which is true if Vi is the last 
vertex in its group, false otherwise. 
This requires one bit of information 
per vertex. 

More formally, these binary arrays take logical values, 
true or false^ as defined by 

def 

r](a,i) <(=^ a e Ey. (3) 

and 

fi ^ n 7^n"+iOr/ = «-l (4) 

where Vi < v^+i follows from the lexicographic sorting 
of the vertices. We could have added as a convention that 
v„ = oo^~^, in which case special handling of the final 
position would not have been required. 

The grouping of vertices with the same k — 2-prefix 
allows us to check which in-edges originate from the same 
vertices: for a^b e S, v e V, edges au and hv orig- 
inate from vertices au~ and hv~ respectively, which is 
the same vertex \i a = h and u~ = v~, which cor- 
responds to checking if u and v are in the same vertex 
group. 



Main data 

Let n = |y| be the number of vertices of the final- 
completed de Bruijn subgraph, and let vq, . . Vn-i denote 
the vertices of V in lexicographic order; in particular, vq = 
$^~^, which is the only final vertex of the final-completed 
de Bruijn subgraph. The basic information required 
to store the final-completed de Bruijn subgraph of 
5^ is: 

Edges: The set Ey. c E of edges from each 
vertex v/ is stored; i.e. the edge set E 
identified as a subset of E x 1/. This 
may be encoded as a a x « array, 
ri{a, i), with binary values: i.e. 
r](af i) G {false, true} indicates if 
avi G E. The in-edges 
Ei = {a I avi G E} to vertex v/ may be 
represented as a bit-mapped number 
on which set operations correspond to 
binary operations. 
Group end flags: We group vertices v e V with the 

same k — 2-prefix together: i.e. u and 
V are grouped together if 



Index to previous vertex position 

In addition to the main data, there is an indexing func- 
tion which may be generated from the main data. These 
are values per vertex group, i.e. constant within the groups 
of vertices with the same k — 2-prefix. Storing the entire 
indexing function as auxiliary data would take a lot of 
memory, while computing everything from scratch would 
take a lot of time. Instead, a balance between memory 
and speed is obtained by storing a sparse subset, e.g. at 
regular intervals, and recompute the values in-between 
on demand. This index corresponds to the FM-index and 
provides a map from a vertex to the origin vertex of its 
in-edges. 

For each letter a g E, let T(a) denote the number of 
vertex groups that contain an a in-edge. For / = 0, . . . , 
let c(ay i) denote the number of vertex groups prior to 
position / that contain an a in-edge: the vertex group con- 
taining Vi is not included in this sum. This makes r (a) = 
c(af n)j since the vertices all have positions / < n. 

For a e E, / = 0, . . . , /z, let 

p (a, i) = r (b) + c(a, i) (5) 

b<a 
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where the summation of b < a is for all ^ G E lexico- 
graphically prior to a, and the 1 corresponds to skipping 
the vertex vq = $^~^. Note that if is the letter following 
a in the alphabet E, i.e. = a -\- lifwe consider the let- 
ters to be enumerated 0, . . . , a — 1, then p(a^, 0) = p(a, n); 
and for a the last letter of E, i.e. a = a — 1, we have 
p(a, n) — n. 

The position array, p{a, i), has the property that if ver- 
tex Vi has an in-coming edge avi e i.e. a e Ey.y this edge 
comes from vertex Vp^aj)- A more general definition is 
that 

p(af i) — min{/ | Vj > avj or / = n} (6) 

where, as before, v~ = V[i^^_2] for k — 1 -strings v e V and 
u > V refers to the lexicographic ordering of strings. 

We may also note that, if we represent letters a by inte- 
gers 0, . . . , a — 1, we have pia, n) = p(a + 1, 0), and may 
define p(an + /) = p(ayi), where p(j) is again a non- 
decreasing function for ; = 0, . . . , n x cr with p (0) = 1 and 
p(n • a) = n. Representing p(a, i) in terms of pian + /) is 
sometimes convenient, e.g. when we want to compute the 
inverse, and similarly we may use an + / to represent an 
in-edge, or potential in-edge, {Uy vt). 

As may be noted, the role of p{ayi) in mapping 
from one vertex to another is essentially the same as 
the FM-index for mapping from one suffix to another. 
The main difference is in the grouping of vertices into 
groups, where counts are over vertex groups rather 
than individual vertices. The reason this vertex group- 
ing is required is that the de Bruijn subgraph allows 
branching, i.e. for vertices to have more than one out- 
edge; the vertices in a vertex group share the same set 
of in-edges. The traditional FM-index could be envi- 
sioned as a de Bruijn subgraph with no branching, where 
each vertex has exactly one in-edge and one out-edge, 
and so the vertex groups would all consist of just one 
vertex. 

Auxiliary data stored for computation speed 

Storing pia, i) (or equivalently c^a^ i)) as an array of inte- 
gers requires a x Ig w bit for each vertex. However, 
if p(aj) are known for some nearby y, the number of 
computational steps to compute p(a, i) from p(aj) is pro- 
portional to \i — 7|. So by storing only a subset of the 
p(aj), e.g. every qth position, the memory required for 
storing the auxiliary data is greatly reduced, but at the 
cost of computational time for determining p{a,i). The 
partial storing of p{a,i) is essentially the same as for 
the FM-index, and can be done in a number of different 
ways. 

Let 0 = Iq < - - - < = n he the position for 
which p{a,i) is to be stored, with ir — ir-i < q for 



some chosen q. The stored values then consist of an array 
K{a^ -\- r) = p(a,ir) where a = 0, ...,cr — 1 repre- 
sent the letters. Thus, we have K(j) for 0 < / < 
with increments 0 < /c(;) — /c(; — 1) < q. Storing the 
entire k array is, however, still space consuming unless q is 
allowed to be big, in which case computing p (a^ i) will take 
time. 

Knowing that /c(;) increases by values between 0 and 
qy we can write K{j) = uj + qllj where Uj = \_K{j)/q\ 
and 0 < Uj < q and store the Uj in a bit-packed 
array. The values Uj now have increments AZiy = Uj — 
Uj-i G {0, 1}. We store the increments AZ/y as an array 
of bits, and a subset of the Uj from which the remain- 
ing Uj can then be computed efficiently. This allows us to 
select a much smaller value for q than would otherwise be 
feasible. 

Java implementation 

The Java implementation stores rj {a, i) andj^ as a a + 1 bit 
block for each vertex l These blocks are then packed into 
64 bit words (long integers). For DNA sequences, each 
64 bit word thus stores 12 vertices, each using 5 bit. 

The stored values of p(aj) are expressed in terms 
of k(J) = Uj + qUj, The Uj are bit-packed into an 
array to preserve memory. For reconstruction of the Uj, 
every 64th Uj is stored, i.e. Zi64r) and the increments 
AZiy are stored in blocks of 64 bits. Any Uj can then 
be computed efficiently from Zi64r> r = [7/64], and the 
increments AUe^r+v • • - y^Uj using operations on 64 bit 
words. 

Fundamental functions for utilising the data structure 

For a string x and a e E, we define the function y{x) 
recursively by 

y(6) = y($) = 0, y{oo) = n, 
y(a) = p (ay 0), y (ax) = p(a,y (x)). 

This function has a natural interpretation. \i \x\ < 
/c, y (x) is the smallest non-negative integer / for which 
X < vu or i = n if none such exists. If |^| > k,y(x) 
is the smallest integer / for which x < viy for some 
string y e E* so that v/y can be realised as a path in 
the de Bruijn subgraph (V,E), i.e. all /:-substrings of v/y 
are in or / = fz if no such string exists. This property 
is formally proved in Lemma 2 in the Proofs of results 
appendix. 

In order to utilise the data structure, for strings x with 
length \x\ < k, we define two utility functions: the first 
vertex v >xis 

a(x) = y(x) = min{/ \ Vi > x or / = n} (8) 
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while the first vertex with v > xoo is 

P(x) — y(xoo) = min{/ | Vi > xoo or i = n} (9) 
Recall that oo > a for all ^ g E, so 

finds the first v € 1^ for which V[i^/] > x, while Qf(;v:) finds 
the first v for which V[i,/] > Thus, a and ^ are defined 
by the property 

{v^V\ V[i,/] = ^} = [vi I c^(^) < / < P{x)} (10) 

for all ;v G E^, / < /c, making them exactly the functions we 
need to identify all vertices starting with a given prefix. 

Note that if we add the vertex = oo^~^ to the list, 
we wouldn't have to specify the / = n case in the above 
definitions. Again, this addition would purely be a matter 
of convenience and not have any practical impact. 

Algorithms for using the data structure 

The above described data structure encodes a de Bruijn 
subgraph representation of the /c-substring composition 
of the strings 5. However, to utilise this representation, we 
need efficient algorithms. 

Throughout the algorithms, vertices of V will be identi- 
fied by their position i e {0, . . . , n — 1} in the lexicograph- 
ically sorted list vq, . . . , Vn-\ where n = \V\. The string 
that each vertex represents will generally not be known. 

The alphabet E is known from the start and the 
letters ordered. Computationally, it is natural to repre- 
sent the letters by numbers 0, . . . , d — 1 (ignoring the 
letter $) since they are to be used as array indexes. 
However, for added readability, I will denote them as 
letters a e Ti in the algorithms rather as numerical 
indexes. 

Computing the previous position p (a, i) for arbitrary 
positions 

A first step is to be able to compute p (a^ i) for arbitrary 
positions based on the stored data. 

Let 0 = io < . . . < ii^ = nhe the values for which p (a, i) 
is stored, i.e. Pstorei^i^ h) = Pi^y h)y and define functions 

= mm{ir\ir > = maxjrVI^V < i} (11) 

for pointing to the next or previous stored value. We may 
then compute p (a, i) by aggregating from the vertex group 
containing p{a,r (/)) as in Algorithm 1. 



Algorithm 1 Compute arbitrary pia^i) from previous 

stored value 

function pia^ i) ><3^ G E, / g {0, . . . , n} 

Hi = 0 then return PstoreC^^^ 0) end 
/ ^ > stored position / = iy < i 

P ^ Pstorei^J) > stored value 

while not fi-i do / ^ / — 1 end 
hasEdge ^ false 
while j < i do 

Ha ^ Ej then hashEdge ^ true end 
if fj then > end of vertex group 

if hasEdge then p ^ p-\-l end 
hasEdge ^ false 
end if 

end while 
return p 
end function 



An alternative is to start at p(a,i^(i)) and subtract 
contributions from vertex groups prior to this posi- 
tion, which can be done with a similar algorithm (see 
Additional file 1). To speed up the procedure, one may 
identify the nearest stored value, either previous or later, 
and use whichever of the two algorithms is appropriate. 
This will on average double the speed, and is done in the 
Java implementation. 

Find all vertices with a particular prefix 

The functions a(x) and fi(x) for strings x with length |»;^| < 
k are both naturally expressed in terms of the the func- 
tion y. We note that y gets called either as y(x) = y(x$) 
or as y(xoo)y and so we can express these as y(x) = 
y (x$) = y (x, 0) and y (xoo) = y (x, n) where y (ax, i) = 
p(a,y(x,i)) and for ^ = 6 the empty string y(e,i) = i. 
Algorithm 2 details the computations. 



Algorithm 2 Algorithm for computing y(x,i): then, 
a(x) = y(x) = y(XyO) and ^(x) — y(xoo) — y(x,n) 
function y (x, i) 

for p = \x\ to 1 step — 1 do 

i ^ p(XpJ) 
end for 
return / 
end function 



By combining the call to a(x) and ^(x) into one function 
(Algorithm 3), it is possible to exploit the fact that when 
the interval is empty the two computations become iden- 
tical. The return value [a(x), p(x)) represents the interval 
a(x), . . . , P(x) — l, and may be represented by a pair (a, fi), 
lfa(x) = P(x)y the interval is empty: the function could be 
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modified to abort once it is clear that the resulting inter- 
val will be empty, or at least reduce computations by half 
once it is clear that a(x) = P(x). 



Algorithm 3 Compute [a(x), fi(x)) 
function INTERVAL^ 

i ^ 0;j ^ n > start with [ij ) = [0, n) 

for p = |:Jc|downto 1 do 
/ ^ p{Xp,i);j ^ p(xpj) 
— may abort returing null ifi=j — 
end for 
return [/,/) 
end function 



When merging two kFM-indexes, [a{x), p{x)) is com- 
puted numerous times on one kFM-index with x being 
vertices (or vertex prefixes) from the other kFM-index. 
When the result maps a vertex in one to an vertex in the 
other kFM-index, the two are merged; when a vertex v is 
mapped to an empty interval in the other kFM-index, the 
position a{v) = P(v) tells the merge procedure into which 
position it should be merged. 

Backtracking through the de Bruijn subgraph 

For a string x of length \x\ = I > /c, walking the de 
Bruijn subgraph path corresponding to x is most easily 
done by starting at the end of x and backtracking the graph 
towards the start of x. Algorithm 4 provides the algorithm 
for doing this: it will start at the k — 1 -suffix X[i-f(^2,l] and 
backtrack one step at a time, exiting if the string leaves the 
graph. 



Algorithm 4 Backtrack the de Bruijn subgraph for a string 

X of length > k 

I ^ \x\ >x string of length I = \x\ > k 

V ^ X[i-j^-^2,l] > ending vertex, i.e. k — 1 -suffix 

[/,;> ^[a(x),P(x)} >INTERVAL(^) 

if / = / then exit end > v does not exist 

— / now points to the vertex X[i-k^2,l] — 

for p = \l — k-\- l|downto 1 do 

if Xp ^ Ei then exit end > no in-edge 

/ ^ pixp, i) > previous vertex 

— / now points to the vertex x\p^p-^k-2] — 

end for 



Identifying the string value of a vertex 

If we start with a vertex identified by its position /, we 
can determine the string that vertex represents. In order 



to do so, we need a function p^^^ : {0, — 1} 
E X {0, . . . , w — 1} which has the property that 

= {a,j) p(aj) = iyp(aj +!) = /+! 

(12) 

and where p^^^(O) = ($,0). The pair (ayj) can be found 
through a binary search. However, since the computation 
of p is done in a stepwise manner starting at one of the 
stored values, a binary search should be performed on 
the stored values to identify the interval that contains the 
solution, and the stepwise procedure then followed until 
the solution is found. 

The interpretation of p^^^(i) = (a J) is that a is the first 
letter of v/, while ; is the last vertex in the vertex group 
with k — 2-prefix equal to the k — 2-suffix of Vf. i.e. there is 
an edge from Vi to a vertex in the same vertex group as v/. 
However, since the vertices in the same vertex group only 
differ by the last letter, and we don't have to determine this 
letter, we do not have to determine which vertex (or ver- 
tices) in this vertex group has an edge from v/. By iterating 
this procedure k — 1 times as in Algorithm 5, we can find 
the string v^. 



Algorithm 5 Return string Vi for position / 
function Vertex(0 
X = string [/c — 1] 
for = 1 to /c — 1 do 
(aj)^p^m) 
Xp a 
end for 
return x 
end function 



Generating the kFM-index from a set of strings 

The simplest way to generate the main data, i.e. the in- 
edge list and vertex group end flags, from a set of strings, is 
to generate the set of all /:-substrings, including the final- 
completing strings with one or more more $ attached at 
the end, sort and group them by their k — 1- suffix, and 
then generate the in-edge list and group end flags directly 
from this. Once the main data, i.e. the binary arrays ri and 
fi representing the edge set E and the group end flags, have 
been generated, the auxiliary data can be generated from 
these. 

This brute force approach requires a fair amount of 
memory since all k letters need to be stored for all k- 
substrings. The list of /:-substrings thus takes up k times as 
much memory than the original strings from which they 
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are generated, and so is only feasible when the original 
string data is moderate in size. 

If the string data is large, so that not all /:-substrings of 
S^^^ can be kept in memory, the job may be split up. The 
set S of strings may be split up into smaller subsets, the 
kFM-index generated for each subset, and pairwise merg- 
ing of kFM-indexes may then be performed to combine 
the subset based indexes into a kFM-index for the whole 
set. 

Note that this procedure will generate a full set of 
final-completing vertices, i.e. those containing $ at the 
end, even when they are not required by the kFM-index. 
We may reduce the kFM-index by checking which final- 
completing vertices are actually required in order to be 
able to reach the entire graph by backtracking from the 
final vertex. However, even if we do this for the ini- 
tially generated kFM-indexes so as to ensure these con- 
tain minimal sets of final-completing vertices, when we 
merge the kFM-indexes for the string subsets, superfluous 
final-completing vertices may again occur when final- 
completing vertices required in one kFM-index are ren- 
dered superfluous by the edges of the other kFM-index. 
Hence, we may wish to prune away these final-completing 
vertices at the end, or in some of the intermediary 
merges. 

Merging two kFM-indexes 

If we have two kFM-indexes denoted A and one with 
riA elements and the other with hb elements, these can 
be merged in two steps. First, we merge the two lists 
into a list of length ha + In the process of merg- 
ing the two lists, rows representing the same vertex are 
not combined, but instead we mark the occurences where 
one vertex merged list is identical to the next one so that 
these may later be combined. In addition, vertex groups 
found in both A and B must be merged into one vertex 
group, which involves removing the group end flag from 
any vertex not at the end of the merged vertex group. 
After that, we sequentially pass through the ha + ns- 
length merged list, combining identical vertices into one 
vertex. 

Instead of generating the ha + -length list in full, 
which would require the same amount of memory as the 
two original kFM-indexes of length ha and hb, "we can 
represent the merge by a + bit array indicating 
which of the two lists go into each position. Another bit 
array is used to mark identical vertices, which only needs 
Ha bit since we only need to store which vertices in A 
are also found in B. Finally, we use an + ns bit array 
to mark vertices in the merged list that should not keep 
whatever group end flag it might have: this is required 
for vertex groups found in both A and B to ensure that 
only the final vertex in the group retains it group end 
flag. 



Algorithm 6 Merge two kFM-indices 
function Merge(A,B) 

is A ^ array + hb] (false, . . . y false) 
same ^ array [n^] (false, . . . .false) 
group ^ array + hb] (false, . . . , false) 
for / = 0 to A: - 2 do 

PreMerge(/,0, 1,0, 1) 
end for 

PreMerge(/: - 1, 0, riA, 0, hb) 
merged ^ PerformMergeQ 
— compute and store Pstorefor merged — 
return merged 
end function 



It is sufficient to find the positions in the ha + w^-list 
of the Ha vertices from the kFM-index A. We may assume 
yiA S yiB since that will require only ha lookups; the hb 
vertices from the kFM-index B will then be in the remain- 
ing positions. To do this, for all items / = 0, . . . , — 1 
in the A list, compute the string vf of that vertex. We 
then look up the position of vf in the kFM-index B by 
computing [ocB(vf)y PB(yf))' If the interval is empty, i.e. 
aB{vf) = Pb(^)> the vertex goes into position / + aB(vf) 
and so we mark this position as containing a vertex from 
A, If the interval is not empty, the vertex vf is found in 
position aB(vf) in the B list (and ^Bivf) = oiB(vf) + 1), 
and so again we mark position / + aB(vf) as contain- 
ing an A vertex, while the vertex from B takes position 
/ + aB(vf) + 1; in addition, we mark vertex / in the A list 
as having a duplicate in the B list. 

We similarly need to iterate over all vertex groups in 

A, i.e. all non-empty [o(a(u), Pa(u)) for u e E^^~^. If 
the vertex groups [aA(u), Pa(u)) and [aB(u), Pb(u)) are 
both non-empty, we know that the vertices in the interval 
[oia{u) -\-aB(u), Pa{i^) + fiB(u)) correspond to the u vertex 
group in the preliminary merged list. The last vertex, i.e. 
the one in position ^Ai^) + y^5(w) — 1> will have its group 
end flag set from either list A or B. However, there will be 
another vertex with its group end flag set from the other 
list which may be in any of the other positions in the inter- 
val, and to ensure that this is unflagged, we mark positions 
aAiu) + asiu), . . . , ^a(w) + Pb(u) - 2 for group end flag 
removal. 

Rather than process the items / = 0, . . . , «^ — 1 sequen- 
tially, which requires computing Vertex(0 for each and 
then looking these up in B, it is more efficient to recurse 
over all p-mers iorp= 1, . . . , /c — 1, each recursion adding 
all possible one letter prefixes. Thus, 1 = 0 corresponds 
to vertices, / = 1 to vertex groups, [a,y6) refers to index 
intervals corresponding to a given p-mer prefix in A or 

B. The function PreMerge performs this recursion. It is 
called from MERGE where I = k — 1 — pis the number of 
trailing $. 
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Algorithm 7 Prepare merge: recurse over A intervals 
function PreMerge(/, aA, Pa> o(b> Pb) 

if a A = Pa then exit end > empty 

if / = 0 then > vertex 

isA[aA + or^] ^ true 
if as < Pb then same[aA] ^ true end 
exit 
end if 

if / = land as < Pb then > vertex group 

group[aA + Qf^] , . . . ygroup[PA + Pb-2]^ true 
end if 

for /3^ = 0tocr — Ido > prefixing letter 

PREMERGE(/ — 1, PA(a,aA)> PA(a, Pa)^ 
PB{cL,aB),pB{cL,pB)) 

end for 
end function 



Once we have the ha -\- ns bit array indicating which 
positions in the H-w^ merge comes from A or 5, another 
Ha bit array telling which vertices in A are also found in B, 
and a third ha -\- bit array marking vertices for group 
end flag removal, we can merge the two lists sequentially 
using PerformMerge. 

Algorithm 8 Merge subroutine: perform merge based on 
merge information 
function PerformMerge 

merged ^ empty list of kFM-data 

iA> IB ^ 0 

while iA + iB < ^B do 

if not isA[iA + is] then > vertex from B 



uB 



> in-edges from B 



/new ^/^and not groupUA + ib] 



IB ^ + 1 
else if same[i a] then 

^new ^ J^i^ ^ J^i^ 



> vertex in A and B 
> combined in-edges 



/new andnot^roM/7[/A + iB + 1] 



iA + 1; iB 



iB + l 



lA 

else > vertex from A 

Enew ^ Ef > in-edges from A 

/new ^/•^andnot^roM/?[/A + iB\ 

lA^lA^^ 

end if 

push merged ^ (Enew>fnew) > add to list 

— may remove data prior to ia and is — 

end while 

return merged 
end function 



The three bit arrays used to facilitate the merge require 
3nA+2nB bit of extra data. The merge requires ha lookups 



to find vf and then [asivf), ^Biyf)) followed by the copy- 
ing of all nA-\-nB elements combining them into one when 
they represent the same vertex. In addition, the creation 
of the target list requires a temporary duplication of all 
the vertex and edge data, but this could be avoided by 
using a data structure in which the merged list is being 
created gradually as needed while memory used by A and 
B is gradually released as they are being merged. The Java 
implementation provided does this. 

Pruning away superfluous final-completing vertices 

The removal of superfluous final-completing vertices, i.e. 
vertices ending with one or more $ characters that are not 
required in order to avoid final vertices that have no out- 
edge, can be done by a few simple rules. We can perform 
these checks by a recursive approach, exploiting that the 
final-completing vertices form a tree with the final ver- 
tex, $^~^, as the root. We start at the final vertex, which is 
in position 0 of the kFM-index, and recursively backtrack 
through all in-edges at most k — 2 steps to reach all final- 
completing vertices in V, performing the tests depth-first. 
We then identify edges and vertices that can be removed. 
After all final-completing vertices have been processed 
in this manner, we condense the list by removing the 
superfluous edges and vertices from the list. 

Algorithm 9 Prune the index of unneeded final- 
completing vertices 

function PruneFinalCompletions 
Vprune ^ empty list of vertices 
^prune ^ empty list of pairs (position,edge set) 
CheckUnused(/: - 1, 0) > start at final vertex 
for (/,£new) ^ ^prune do Ei ^ E^q^ end 

V^V\Vprune 

— recompute and store pstore — 
end function 



If V G F is a final-completing vertex, it is superflu- 
ous and can be removed if it has no in-edges. By having 
no in-edges, I include cases where the in-edges from ver- 
tices marked for exclusion have been removed: this is the 
reason why the tests must be done depth-first. Since p 
depends on which in-edges exist in each vertex group, 
superfluous in-edges can be removed immediately only 
for final-completing vertices where there is another ver- 
tex in the same vertex group with an in-edge from the 
same vertex (i.e. the same in-edge prefix). In general, edges 
and vertices must be marked for removal while the final- 
completing vertices are checked, and only removed after 
the checking is finished. 

liv = u% e \^ is a final-completing vertex ending with 
a single $ and e = av is an in-edge to v for some a G 
E, the edge av can be removed if there is an /^-in-edge to 
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another vertex in the same vertex group as v: i.e. if there is 
another vertex ub e V which has an in-edge aub e E, the 
in-edge av = au% can be removed from the in-edges to v. 
The edge au% is superfluous since au can be reached by 
bacl<tracldng from ub. If all in-edges to v can be removed 
by this rule, the previous rule then allows v to be removed. 



Algorithm 10 Prune recursal from vertex / ending in $ , 

return true if unneeded 

function CheckUnused(/, /) 

£del ^ 0 

if / > 1 then > only vertex in group when / > 1 
for a e Ei do 

; ^ p(a,i) 

if CheckUnused(/ - 1,;) then 

push Vprune ^ / > remove vertex vj 
Ede\ ^ Ede\ U {a} > remove in-edge a 

end if 
end for 

if £del 7^ 0 then 

push fprune ^ (i>Ei \ £del) 

end if 

else if not fi then > other vertices in group 

while not fj do 

^del ^ ^del U Ej 

end while 

Ei ^ Ei \ Edei > can replace Ei immediately 
end if 

return {Ei \ £del = 0) > i^^ue if unused 

end function 



The pruning away of superfluous final-completing ver- 
tices is not required for the kFM-index to work, but it can 
reduce the memory required in cases where the number of 
such vertices is big. As such, depending on the number of 
final-completing vertices at any point of the processing or 
merging of kFM-indexes, one may choose to perform this 
pruning at the end or at some of the intermediary merges. 

Pre-assembly 

As a first step of sequence assembly, and a good task for 
assessing both the resulting de Bruijn subgraph and the 
efficiency of its use, uniquely determined paths of the 
graph are determined. This consists of two steps. First, 
the list of vertices are checked to identify all vertices that 
have in-degree or out-degree different from one:an algo- 
rithm is provided in the Additional file 1. The remaining 
vertices are simple, non-branching vertices that paths just 
pass through. Iterating over all branching/ending vertices. 



all possible paths passing through degree-one vertices are 
generated. When generating the sequence correspond- 
ing to a path, the sequence of the last vertex of the path 
needs to be found using Algorithm 5, while the rest is 
determined when backtracking through the in-edges. 

The result is a list of non-simple vertices, and a list of all 
uniquely determined paths between these. An estimate of 
the number of such paths can be found simply from sum- 
ming over the in-degrees of all branching vertices, but this 
may include paths consisting entirely of final-completing 
vertices. When generating the list of uniquely determined 
paths, those containing only final-completing vertices are 
excluded. 



Results 

Memory usage 

The main data kept in memory is the a x n binary array 
r](a, i) and the binary flags for marking the end of each 
vertex group. Direct storage of these in bit-packed arrays 
requires cr + 1 bit of information per node, so the total 
memory for storing the in-edge list and vertex group flags 
is 

OTem [EJ] = « X (or + 1) bit. (13) 

For efficient computation of the previous vertex posi- 
tion p{a,i), a subset p{a,ir) is permanently stored for 
positions 0 = /q < • • • < = Direct storage of 
K{i;a + r) = pia, iy), where a = . . . ,g — 1 represents 
the letters, would require \gn bit since each position 
requires Ig n bit. However, decomposing /c(;) = Uj + qllj 
where ir — ir-i < q> and storing the Ui and AUj = 
Uj — Uj-i G {0, 1}, requires only \gq-\-l bit for each value 
in k: i.e. crf (Ig^ + 1) bit where f ^ n/qii the ir are evenly 
spaced. 

In order to efficiently compute arbitrary Uj, the Java 
implementation stores U^^r^ Since each of these values 
require \%{n/q) bit of memory, the total memory require- 
ment for storing the in-edge list, vertex group flags, and 
the data used to compute the previous vertex position, is 

OTem [pstore] ^ y (ig^ + 1 + ^^^) bit. (14) 

The memory saving construction used to compress k 
could be repeated for the stored values /c^ = Ucor by 
writing this as /c^ = + 64[/^, but at an additional com- 
putational cost. However, for most practical cases, the 
term \g{n/q) /64 is already a very minor part of the mem- 
ory cost and not worth the computational overhead. By 
the time n > 2^^ becomes an issue, we have probably 
moved beyond 64 bit computers, in which case increas- 
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ing the block size of 64 to the higher word size co changes 
the memory term to \g(n/q)/a) without increasing the 
computational time. 

Under the assumption that n < 2^, where co = 64 
is the word size of a 64 bit computer, the total memory 
requirement is 



mtm [£,/, p]<n 



bit (15) 



although there may be some additional overhead depend- 
ing on how the data is bit-packed into arrays. 

For DNA, (7 = 4, which requires 5 bit of data per vertex. 
However, storing 12 vertices packed into a single 64 bit 
word leaves 4 unused bits, and so it effectively consumes ^ 
5.333 bit per vertex in the present Java implementation. 
For stored previous vertex positions, Pstore> natural step 
sizes q between stored values are q = 16, 32 and 64, which 
adds 1.5 bit, 0.875 bit and 0.5 bit of memory usage per 
vertex, respectively. 



Computational speed 

Estimates of computational speeds are based on the 
assumption that n < 2^ where co is the word size: i.e. 
CO = 64 on a 64 bit computer. This means that a num- 
ber in the range 0 to n can be read from memory in one 
operation: for arbitrarily large n, this would require at least 
(\gn)/co operations. It also means single operations can 
operate on co bits at the same time, although this is largely 
unexploited by the implementation. 

The central algorithm that influences most kFM-index 
computations is that of computing arbitrary p(aj): 
Algorithm 1, or the extension of this provided in the 
Additional file 1 and implemented in the Java program. 
A subset of the values are stored in a compressed form 
and can be retrieved in constant time. If every qth value 
is stored, the time required to reconstruct and arbitrary 
piUy i) will on average be of order 0{q), However, since we 
will in practice select a fixed q of moderate size, which is 
sufficient to keep memory costs of the auxiliary data at a 
low level, and a fixed computational time remains even as 
we let q drop towards 1, we may consider the computation 
of p{a, i) to be of constant time. 

Algorithm 2 for computing y{Xyi) for any string x 
requires \x\ calls to p, and thus has time complexity 0(|;v:|). 
Consequently, identifying the interval of vertices with 
prefix X (Algorithm 3) has time complexity 0(|;v|). Algo- 
rithm 4 for backtracking the graph from vertex / along 
edges provided by the string x is essentially the same as 
the computation of y , just with checks that the edges exist, 
and also has time complexity 0(|;v:|). The reverse compu- 
tation of finding the string representation of a given vertex 
index, provided in Algorithm 5, requires solving for p^^^(0 



using a binary search, and is thus of time complexity 
0{k\%{na)). 

Constructing a kFM-index in memory, provided the 
memory is sufficient to hold a complete list of /c-mers 
from the strings, has time complexity 0(A//clgor IgA/") 
where N = ||<S|| is the total length of the string data. 
The time is primarily required for sorting the list of in- 
edges generated from the strings, while construction of 
the kFM-index from the sorted list is linear in N. 

Merging two kFM-indexes of sizes p and q using 
Algorithm 6 has worst case time complexity 0((p-\-q)kcr), 
This is due to Algorithm 7. The factor a stems from check- 
ing sequentially for in-edges and could most likely be 
replaced by something more efficient should large alpha- 
bets be of interest. There is also room for improvement, as 
detailed in hte Additional file 1, e.g. by reducing the num- 
ber of computations once conditions like as = Pb are met. 

Construction of a kFM-index in memory by dividing the 
initial string data into m parts, generating kFM-indexes 
from each, and then merging these pairwise until a single 
kFM-index remains, has time complexity 0{Nka Ig (nm)) 
using the present algorithms. At the lowest level, m kFM- 
indexes are generated, each from sorting approximately 
N/m /<"- words, which takes 0(Nklgcr Ig (N/m)) time and 
is generally fast. During the first roughly \g(nm/N) 
rounds of pairwise merges, while the number of partitions 
is higher than the coverage, the total sizes of the kFM- 
indexes may still be ^ N, and so each round requires 
time 0(Mc(7). After that, since none of the kFM-indexes 
have more than n vertices, which is the size of the final de 
Bruijn subgraph, the time complexity drops by a factor of 
two for each new round of pairwise merges until a single 
kFM-index remains. The main part of the computations 
are the first \g(nm/N) or so rounds of pairwise merges, 
and so the final kFM-index takes 0(Nka Ig {nm/N), As N 
cannot increase without m increasing in proportion, since 
N/m in-edge /:-words must be kept in memory, but m can 
increase independently if less memory is to be used, it is 
more natural to write this 0(Nka Ig (nm)). 



Benchmarking of the Java implementation 

The Java implementation has not been optimised for 
speed: it runs on a single core, and prioritises mem- 
ory consumption and code generality and readability over 
speed. However, it can still give a fair indication of the 
computational speeds, and indicate which are the bottle- 
necks. 

The benchmarks on E, coli and simulated data were run 
on Java 6 under 64 bit Windows 7 on a standard office lap- 
top: Dell Latitude E6320 with Intel Core 17-2620 2.70 GHz 
CPU and 8 GiB RAM. For C. elegans and the soil sample, it 
was run on a server with more memory and roughly twice 
the computational speed. The amount of RAM available 
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to Java was set with the option -Xmx. Note that for spec- 
ifying computer memory, I use lEC prefixes ki-, Mi-, and 
Gi- which represent powers of 1024, while SI prefixes k-, 
M-, and G- represent powers of 1000. 

All kFM-index constructions from read data added both 
reads and their reverse complements, discarding pairing 
information. Quality filtering consisted of removing bases 
with quality score less than 30, splitting the reads into 
fragments with higher quality bases. Unless otherwise 
stated, k = 23 were used: note that the Java implemen- 
tation specifies the order k — 1, i.e. length of the vertex 
strings. The distance between stored values Pstore was 
q = 32. This should require 6.2 bit per vertex as the actual 
memory usage on the data, including 0.33 bit due to the 
4 unused bits in the pack of 12 vertices stored in a 64 bit 
Java long integer, although there would be some additional 
memory overhead from the program itself. 

Memory usage during kFM-index construction was 
largely determined by the size of partitions, i.e. the maxi- 
mal number of /c-mers processed in each partition, which 
was set to different values to assess the time required to 
generate kFM-indexes by merging smaller indexes. For 
runs on the laptop, it was set to process at most 250 M 
words in each partition, which for k < 28 would require 
2 GiB of memory with each word using 2 x 32 bit integers; 
on the server, the partition size was limited by implemen- 
tation of the buffer as a Java array with at most 2 G 32 bit 
values, allowing at most 1 G words in each partition when 
k < 28. However, the peak memory usage reported here 
includes memory used and released by Java, but not yet 
garbage collected, and may reflect available memory more 
than actual use. 

£ coli sfr. K- 12 substn MG1655 

The implementation was evaluated on E. coli str. K-12 
substr. MG1655 (SRA accession SRX000429, SRR001665, 
http://wwwncbi.nlm.nih.gov/sra/SRX000429) with 21 M 
36 nt reads after discarding pairing information. 

The quality filtered graph contained 13.4 M vertices and 
13.4 M edges. This was processed in 2 parts and then 
merged, taking 7.7 minutes, with roughly a fifth of the 
time spent on merging kFM-indexes. Without quality fil- 
tering, the graph contained 81.7 M vertices and 83.7 M 
edges. For kFM-index construction, this was divided in 5 
parts, which were then merged, taking 38 minutes, half of 
which was spent on merging the kFM-indexes together. 

Once the final kFM-indexes had been constructed and 
the temporary memory freed up, the Java program used 
18 MiB holding the quality filtered graph, and 69 MiB 
holding the unfiltered graph. At startup, before adding 
data, the Java program used 6-10 MiB of memory. 

When given access to 6 GiB of RAM, the peak usage 
was a bit over 3 GiB. However, by reducing the available 
memory, peak usage could be reduced to just over 2 GiB, 



with no substantial change in computational time. The 
difference is due to memory that has been used and 
released, but not garbage collected. The main limitation 
was Java's ability to allocate the approximately 2 GiB 
block required to collect and sort in-edges for kFM-index 
construction. 

In the quality filtered graph, 96.3% of the vertices were 
simple, i.e. had in- and out-degree one. Pre-assembly took 
12.9 seconds and produced 427 k uniquely determined 
paths. In the unfiltered graph, only 89.9% of the vertices 
were simple, and as a consequence it produced 8.3 M 
uniquely determined paths using 227 seconds. 

Simulated read data 

Simulated sequence data were generated from two 1 Mnt 
DNA sequences, which were identical random sequences 
except from 0.1% random differences, intended to simu- 
late a diploid organism with SNPs. Random 300 k 100 nt 
reads were generated with error rates 0%, 0.1%, and 1%, 
intended to represent the true sequences, reads with par- 
tial error correction, and raw reads at 30 times coverage. 

The reads with no errors resulted in a graph with 2.05 M 
vertices and 2.05 M edges. This corresponds to 2 x 1 Mnt 
plus 22 extra vertices from each strand of the 0.1% SNPs. 
Accordingly, pre-assembly produced 6358 uniquely deter- 
mined paths, which corresponds to the sequence between 
the SNPs and two variants for each SNP. 

When the reads were given 0.1% error rate, which may 
be a realistic error rate after mild error correction, the 
graph size increased to 3.26 M vertices and 3.31 M edges. 
Pre-assembly resulted in 153 k uniquely determined paths 

Including the full 1% read errors, as is common in 
uncorrected reads, the size increased to 15.7 M ver- 
tices and 16.1 M edges. Pre-assembly resulted in 1.37 M 
uniquely determined paths. 

The time for constructing the kFM-index was 47, 49, 
and 64 seconds, respectively, for the three cases when 
enough memory was allocated to process all the data 
in one part. When the reads were spilt in 6 partitions 
and then merged, this time increased to 109, 139, and 
327 seconds, respectively. The pre-assembly time was pro- 
portional to the size of the graph, and was 0.64, 3.7, and 
35 seconds, respectively. 

C elegans str. N2 

A kFM-index was generated from 67.6 M 100 nt reads on 
C. elegans str. N2 (SRA accession SRX02594, SRR065390, 
http://www.ncbi.nlm.nih.gov/sra/SRX026594). 

With k = 23, this resulted in a graph with 255 M 
vertices and 259 M edges. With reads processed in 8 
parts and merged, this took 5.4 hours on the server. Pre- 
assembly took 7.3 minutes and resulted in 12.8 M uniquely 
determined paths. 
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After completion of the kFM-index, the program hold- 
ing the index used 290 MiB, some of which is unreleased 
memory after pruning away 61 M final-completing ver- 
tices at the end. The buffer used to store each of the 
8 partitions took 8 GiB, while peak memory usage was 
12 GiB. 

Soil sample 

An additional run was made on a soil sample (SRA acces- 
sion SRX128885, SRR444039, http://www.ncbi.nlm.nih. 
gov/sra/SRX128885) with 37 M 76 nt reads. 

Quality filtering left 2.59 G 23-mers to be processed, 
including the reverse complements. The resulting graph 
consisted of 2.86 G vertices and 2.87 G edges: the increase 
relative to the number of words added is due to final- 
completing vertices which represent read suffixes. This 
took 7.1 hours to generate, processing the data in 4 par- 
titions before merging them. Most of this time was spent 
merging the kFM-indexes. 

After completion, the program only occupied 2.2 GiB of 
memory, i.e. 6.6 bit per vertex including all overhead, indi- 
cating approximately 6% memory overhead relative to the 
6.2 bit per vertex required by the data structure. During 
kFM-index construction, peak memory usage was 14 GiB. 

Pre-assembly took 1.08 hours and resulted in 85.6 M 
uniquely determined paths. These appeared to be mostly 
from single reads. 

Discussion 

Memory requirements 

The memory required to store E = S^^^ as a list of strings 
would be l^l X klga bit. If the edge set had been an 
arbitrary subset £ C Xl^, optimal storage would require 

mtm\Eci:^] = \gC;') bit 

^|£|x(/:lga-lg^j bit 

where the approximation assumes that £ is a sparse sub- 
set of E^. This is slightly better than storing £ as a list of 
strings since it takes into account that the list of edges is 
unordered and contains each edge at most once. However, 
the claim that this is minimal required memory [1], is not 
strictly true, as £ = S^^^ is not an arbitrary subset of E^: 
it is induced by the sequences in S. If most of the strings 
in S are much longer than /c, this gives us ample room for 
reducing the memory usage. 

Storing the sequences of 5, i.e. the data from which S^^^ 
is generated, requires 

OTem[5] = ||<S|| X Igor bit (17) 



where ||<S|| = Xl;c€<sl"^l total length of the strings, 

assuming we do not have to store information about the 
lengths of the strings: this is true if all strings x e S have 
the same length \x\ = /, and a good approximation if 
the average length of the strings is much greater than the 
alphabet size. 

If the strings of S are very different, in the sense that 
they do not share /c- substrings to any particular extent, it 
will be more memory efficient to store the strings of S 
directly than storing S^^\ and little can be done to reduce 
this memory requirement. This was the case with the soil 
sample data analysed. In this case, the FM-index [11], 
which is based on the Burrows-Wheeler transform, pro- 
vides a compact index for representing all substrings of S: 
the Burrows -Wheeler transform requires ||5|| x Iga bit, 
i.e. no more than the raw sequences. 

When there is substantial overlap between the strings 
in S, there are more compact representations of S^^\ For 
example, if the strings of S can be assembled into a smaller 
number of strings, 5', of which the strings of S are sub- 
strings, we could use these strings instead to represent 
S^^^^: if each /c-string in S^^^^ is found on average v times, 
this would approximately reduce the required memory 
by a factor of v. However, since assembly is a difficult 
problem, this is not a practical approach. In addition, the 
strings may not assemble well, e.g. due to read sequencing 
errors. 

One fairly direct way to represent a de Bruijn subgraph, 
without storing a complete list of all /c-mers, is to repre- 
sent edges as pointers between vertices: e.g. an out-edge 
from a vertex may be stored as a pointer to the target ver- 
tex. Storing edges as pointers as well as the letters they 
correspond to requires l^l x (Ig 1 1/ 1 + Ig (t) bit of memory, 
which can be quite demanding for large graphs. For the 
13.4 M graph representing the quality filtered E. coli reads, 
this would be 25.7 bit per vertex: more than four times as 
much as the kFM-index. 

If most vertices are simple, i.e. have only one in-edge 
and out-edge, the number of pointers required may be 
drastically reduced by combining them into uniquely 
determined paths in the graph, as is done in Velvet [12]. 
Only one pointer would then be required for each such 
path, which for the quality filtered E, coli graph would 
be 427 k pointers each requiring 18.7 bit, resulting in 
2 + 0.6 bit per vertex for storing the nucleotide and the 
pointers. By combining vertices representing reverse com- 
plements into duplex vertices, the number of vertices and 
paths is halved, but pointers in both directions must be 
maintained, making this 2 + 2 x 0.6 bit per duplex vertex. 

Even if merging non-branching vertices allows the graph 
itself to be stored more compactly, a map from arbitrary 
k — 1-mers to vertices of the graph is required, and storing 
these pointers would cost at least \ V\ x lg| 1/| bit of mem- 
ory. A regular hash map would use additional memory 
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to identify V C but that is not strictly needed 

and could be avoided by smart hashing schemes. Still, 
for the quality filtered E. coli reads, a full map for the 
13.4 M 22-mer vertices would require at least 23 bit per 
duplex vertex, and this would increase for larger graphs. 
If the 22-mers are mapped to the uniquely determined 
paths rather than to the individual vertices, this could 
be reduced to 18.7 bit per duplex vertex, which with the 
3.1 bit per duplex vertex for storing the graph, still adds 
up to twice as much as required by the kFM-index, and 
would increase with the size and complexity of the graph. 
Many genome assemblers, such as ABySS [13], hash either 
vertices or edges, and are thus subject to this requirement. 
In fact, for any data structure that does not throughout 
make use of the fact that the vertices and edges tend 
to form long paths, the memory bound of equation (16) 
applies. 

A natural method to compare the kFM-index against 
is the compressed Burrows-Wheeler transform of the 
concatenated reads used by SGA [10], due to the simi- 
larity between the kFM-index and the Burrows-Wheeler 
based FM-index. The Burrows-Wheeler transform of the 
reads would result in runs of identical bases correspond- 
ing to the coverage of the reads (unless broken up due 
to sequencing errors), and the SGA stores each run by 
its base and run length in a single byte. A naive com- 
parison of memory requirements could be made against 
the in-edge data of the kFM-index which requires 5 bit 
per vertex; if most vertices have degree one, the in-edge 
data can be stored more compactly using only 2-3 bit 
per vertex since most vertices only require the base of 
the single in-edge to be stored. However, this is an unfair 
comparison since the kFM-index only stores the /c-mers 
for a specific /c, while the Burrows-Wheeler transform 
used by SGA stores all substrings, and thus allows /c-mer 
frequencies to be found for all L While the kFM-index 
specifically stores the de Bruijn k-mev subgraph, SGA uses 
overlap-based assembly and was not made with de Bruijn 
graphs in mind. SGA also does frequency based read error 
correction. 

Memorywise, the kFM-index is comparable to the prob- 
abilistic de Bruijn graph using a Bloom filter [5], which 
requires approximately 4 bit of data per vertex. However, 
this can merge k-mevs with their reverse complements, 
and this reduces the memory requirement by a factor of 
two relative to data structures like the kFM-index which 
has to store both. On the other hand, the Bloom filter 
is probabilistic, with a risk of introducing false vertices. 
While it can be used for checking if an arbitrary vertiex is 
present in the graph, although this requires a lower error 
rate and thus a little more memory spent per vertex, addi- 
tional information is required to actually reproduce the 
graph. Certain removal of false vertices as well informa- 
tion required to reproduce the graph can be added [6], but 



requires more memory. In comparison, on the same E. coli 
read data, Minia [6] represented 4.7 M "solid" 23-mers 
(frequency at least 3) using 13.62+0.49 bit per duplex ver- 
tex, where each duplex vertex represents both a /c-mer and 
its reverse complement. The 0.49 bit were used to store 
the marking structure required to reconstruct the graph. 
As such, the amount of memory per vertex is just slightly 
above the kFM-index, given that kFM-index needs two 
vertices to represent reverse complements where Minia 
only needs one. 

As can be seen, the different data structures have dif- 
ferent strength and weaknesses. Reductions in memory 
consumption tends to come at the expense of accessibilty 
of the stored information, computational speed and sim- 
plicity. Choice of a suitable data structure for any given 
problem thus depends on the computational needs, and 
there is no single best data structure for storing /c-mer 
data from high throughput sequencing reads. The kFM- 
index is particularly made for storing the k-mev de Bruijn 
subgraph representation of sequence reads in a compact 
manner, yet allowing efficient random access of vertices 
and edges. 

Further reduction in memory usage 

The main memory usage of the kFM-index is the bit arrays 
rj (a, i) and ft representing the in-edges and group end 
flags. These contain a + 1 binary values per vertex: 5 bit 
per vertex for DNA sequence data. For arbitrary de Bruijn 
subgraphs, little can be done to reduce this substantially. 

De Bruijn subgraphs constructed from real sequence 
data will, however, tend to be dominated by vertices with 
exactly one in-edge and no other vertices in its vertex 
group. In this case, for the majority of vertices, the data 
could simply be summarised by one letter, at G S, repre- 
senting the letter of the in-edge to vertex /: i.e. Ei = {a} 
^ndfi = true. This would reduce the required mem- 
ory from a + 1 bit per vertex, and potentially towards 
lg(a) bit per vertex. The current implementation tar- 
gets cases with small alphabets, like DNA which has 
cr = 4, and little emphasis has been placed on han- 
dling large a where lg(a) bit per vertex would make a big 
difference. 

Since not all vertices can be thus represented, there 
would have to be some way of representing more general 
vertices as well, which would require both some mem- 
ory overhead as well as computational overhead. For DNA 
sequencing reads, the conditions might be met where this 
approach could be useful, and could possibly reduce the 
memory consumption for storing the kFM-index by a fac- 
tor of 2. The entropy of the vertex data, in-edge sets 
and group end flags, for typical kFM-indexes on DNA 
sequencing read data tends to be 2 to 2.5 bit per vertex, 
even for sequence data with read errors included. 
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Effects of read errors 

I have not explicitly addressed to problem of read errors. 
Each read error may result in up to — 1 additional ver- 
tices, one for each of the k — 1 -substrings containing the 
read error. If the error rate or coverage is high, this may 
result in a substantial increase in the number of vertices. 
For example, if the error rate is 6, the chance that a par- 
ticular /c-word from a sequence will contain an error is 
approximately ke. If the average coverage is y, this means 
there will tend to be yke incorrect vertices for each cor- 
rect vertex: i.e. the number of vertices in the de Bruijn 
subgraph when incorrect vertices are included will hen ^ 
(1 + yke) HcoYYect where ^correct is the number of correct 
vertices. E.g. with 1% error rate, 30 times coverage, and 
k = 35, this would make ten times as many incorrect 
vertices than correct vertices. The simulated read data 
illustrate this trend. 

By also storing vertex counts, the kFM-index could be 
used for error correction based on k — 1-mer frequen- 
cies. However, there already exist multiple algorithms and 
applications for excluding infrequent k-mers [10,14-18]. 
Furthermore, much of the advantage of the kFM-index 
over the compressed Burrows -Wheeler transform of the 
concatenated reads [10] is lost if the frequency count is 
to be stored, which is needed for error correction. Hence, 
the basic assumption has been that the primary usefulness 
of the kFM-index occurs when the majority of read errors 
have been removed or corrected in advance, and that error 
correction is more efficiently done prior to kFM-index 
construction. 

The Java implementation allows filtering by base qual- 
ity scores, removing bases with low reliabiUty and splitting 
the sequence accordingly. On a number of benchmark- 
ing runs, this simple approach seemed to be able to 
remove the majority of read errors, and reduce the size 
of the resulting de Bruijn graph substantially. However, 
for genome assembly, further error correction would be 
required. 

Effect of adding final-completing vertices 

The vertices V consist of S^^~^^ with vertices added to 
allow paths from Vjfinai to $^~^. There may be up to /: — 1 
vertices added to V for each vertex in Vfinal- However, as 
long as I Ffinall is small compared to 15^^"^^ |, this has little 
impact on the memory requirements. We may note that 
we can always get | Vfinal I < |*S|, so if the strings on average 
are much longer than k letters, the contribution of V^finai is 
likely to be small 

For final-completing vertices to contribute substantially 
to the memory usage, there would have to be a substan- 
tial portion of read ends not found internally in other 
reads. This could happen for read data consisting of short 
reads with low coverage. It could also happen if there are 



frequent read errors at the flanks of the reads which are 
not filtered out or corrected. 



Computational speed 

The central operation in most uses of the kFM-index is 
the computation of the in-coming vertex position, p(a, i). 
Algorithm 1 gives an implementation whose time com- 
plexity on average is O(^), where q is the distance between 
the stored values Pstore- 

The Java implementation combines algorithm 1 with 
a similar algorithm provided in the Additional file 1 for 
computing p (a, i) from the next stored value rather than 
the previous one, and selects the closest stored value, 
effectively halving the number of steps needed. A good 
balance between speed and memory usage is then to use 
q = ?>2. 

In the Java implementation, the average computational 
time of p {a, i) is linear in q as could be expected. In the 
benchmarking, time per call to p during large kFM-index 
was estimated to 75.8 + 2.8^ ns. Although some of this 
is likely overhead related to the merge operations, and 
the numbers will depend on the computer and implemen- 
tation, it does give an indication that reducing q much 
below 30 is of limited use. One likely reason for this is the 
time required for random memory access is high, while 
repeated accessed to the same block of memory is fast due 
to caching. 

I will treat calls to p as constant time in the subse- 
quent analyses, although the present implementation does 
depend on q, under the assumption that q will remain 
fixed, typically between 30 and 64, and that in this range 
the memory overhead of the stored values Pstore is moder- 
ate. 



Low-level parallel computing of p (a, i) using 64 bit words 

Algorithm 1 works by processing one bit at a time. When 
doing this, the average number of steps required to com- 
pute p{a,i) is proportional to the distance, q^ between 
the stored values. It is, however, possible to parallell pro- 
cess these bit operations and thus exploit that e.g. a 64 bit 
processor can process a 64 bit word in one operation. 

One such method is described in the Additional file 1, 
in which the number of steps for processing one word of 
data is proportional to the alphabet size a. This method 
requires that the flags indicating iia e Ej for consecutive 
7 are stored in one word, and the same for the group end 
flags fjy so that the data for an entire interval of / positions 
can be retrieved from memory in a single operation. A 
replacement for Algorithm 1, provided in the Additional 
file 1, will then compute arbitrary p(af i) from one word 
of containing in-edge data for the letter a and one word 
with group end flags, allowing a distance ^ = 64 — a + 1 
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between stored values: reducing the distance below this 
has no benefit. 

The present Java implementation, however, does not use 
this approach. 

Construction of the kFM-index 

The main bottleneck at present consist of constructing the 
kFM-index from the strings. The provided algorithm in 
which the strings are partitioned into subsets, each subset 
converted to kFM-indices, and these kFM-indices recur- 
sively merged together, introduces substantial overhead 
in terms of memory and in computational time during 
construction. 

In the FM-index setting, where n = N, 3. similar 
approach would have required time 0(n\nn): if we split 
the data into 2^ sets, each iteration merges these pair- 
wise using time 0(n) per round, with r rounds required. 
However, for the kFM-index, in the initial steps when the 
sequences are partitioned, the total number of vertices 
may be much greater due to k — 1 -words present multiple 
times in the reads and thus entering into a large number 
of the subset kFM-indexes. The time consumption may 
therefore increase by a factor proportional to the coverage 
and become of order 0(Nka Ig (nm)) where N = \\S\\ is 
the total size of the sequence data and m is the number of 
partitions into which it is divided. 

The algorithm for merging kFM-indexes could most 
likely be improved substantially, and the Java implemen- 
tation provided is far from optimal: neither in terms of 
speed, nor in memory usage. Some minor improvements 
have been made in the implementation, reducing the 
number of calls to p when certain conditions are met (see 
Additional file 1 for details). However, even with further 
improvements, the method of partitioning and pair wise 
merging is inherently slow. 

The construction of the kFM-index may be split up and 
run on separate CPUs; even the final mergers of two kFM- 
indexes can be split up, e.g. into different threads based 
on the first r levels of recursion. However, if the reads 
have high coverage, each of the subset kFM-indexes may 
already contain many of the same high-coverage vertices, 
and thus require almost as much memory and processing 
power as the final kFM-index. 

For constructing the Burrows-Wheeler transform and 
the FM-index, there are more efficient algorithms [19,20] 
which rely on reformulating the Burrows-Wheeler trans- 
form in terms of a shorter sequence over a larger alpha- 
bet. These do not easily generalise to the kFM-index. 
In particular, a /:-word in the original kFM-index may 
correspond to multiple different sequence locations, and 
these may appear as several different /:-words in the the 
reformulated sequences. However, I have some hope that 
the induced sorting approach [19,21] may be adapted. 



although perhaps not quite as successfully as for the FM- 
index. 

An alternative approach to kFM-index construction in 
memory is to split the reads up into edges representing 
the /c-words of the sequences, and use the disk to help 
make a sorted list. A simple way to do this can be to parti- 
tion the read data into parts that are small enough that the 
in-edge set can be stored as /c-words and sorted in mem- 
ory, write each partition to disk as a single file, and then 
merge these files to construct the final kFM-index. While 
this may require substantial temporary disk space, partic- 
ularly for high coverage sequence data, it will most likely 
be much faster than the in-memory construction of the 
kFM-index presently implemented. This approach could 
also be divided between several computers. 

Use with large alphabets 

The algorithms, and the Java implementation, have been 
written with small alphabets in mind: in particular, DNA 
with (7=4, Technically, they work for general alphabets, 
although the Java implementation cannot at present han- 
dle alphabets with a > 63 since the vertex data are packed 
into a 64 bit word. However, the routines for merging 
two kFM-indexes involve iterating over the entire alpha- 
bet, adding a time factor a to the merge procedure. For 
large alphabets, one might store the in-edge data in a more 
compact form than a bit vector, and could identify the 
relevant letters without having to check them one by one. 

Conclusions 

The kFM-index is a data structure that stores the /c-words 
corresponding to the edges of a de Bruijn subgraph in a 
compact manner, while allowing efficient random access 
to vertices and edges. The data structure is made compact 
by avoiding the direct storage of /c-words and pointers, 
which often are the main memory expense for storing 
de Bruijn subgraphs. The vertex and edge information 
is stored in a direct manner with each line in the data 
table representing a vertex, and each bit set in the in-edge 
bit array representing an edge. Thus, the compactness of 
the kFM-index data structure does not rely on compact- 
ification of the graph or compression of the data, and 
additional compression of the index is feasible. 

The presently implemented method for in-memory 
construction of the kFM-index is uncompetitive for large 
data sets. However, there are multiple ways in which 
this could be improved. Also, as with the FM-index used 
by SGA [10], after the index has been constructed, the 
user can reuse it to try different assembly options and 
parameters. 

One of the main approaches to de novo genome assem- 
bly using high throughput sequencing is to generate the de 
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Bruijn subgraph representing the k-mers of the reads, and 
multiple applications exist already for doing this. For large 
genomes and in meta-genomics, the memory required for 
representing the de Bruijn subgraph is one of the limit- 
ing factors. The kFM-index could replace existing, more 
memory demanding, data structures in existing genome 
assembly applications to allow them to process larger 
genomes or to run on off-the-shelf hardware where spe- 
cial, high-RAM computers have previously been required. 

Availability 

A Java implementation of the data structure and algo- 
rithms, together with additional technical documentation 
of the implementation, are freely available from http:// 
folk.uio.no/einarro/Projects/KFM-index/. Improvements 
to the implementation, removing memory limitations and 
introducing parallel processing, and updated benchmarks, 
are provided on this web site. 

Appendix 

Mathematical approximations 

Based on Stirling's approximation. In ^ n - In ^, we can 

approximate binomials by 

1„(„)^1„,M^<^.1„_^ (18) 

which is a good approximation as long disx ^ n. Another 
bound is 

In© <^ln^ + (n-^)ln^ <nln2 (19) 

where the first inequality follows from — 
p^n-x ^ -j^ upon entering p = x/n, where 0(^)^(1 — 
^^n-x ^ ^n/27Tx{n — x) indicating that the inequality 
is fairly tight, while the second inequality follows from 
g) = 2^ and is only tight for x/n ^ 1/2. 

Proofs of results 

Definition 1. For a de Bruijn subgraph G = (V,E) as 
defined in equations (1) and (2), we say that a string x cor- 
responds to a path in Gif\x\ > k and all k-substrings ofx 
are contained in E: i.e. x = xi . . . xi corresponds to the path 
a = X[i^iJ^k-i\ for i = 1, ...,/ — /c + 1. For a string x of any 
length, we say that x is compatible with G if either x corre- 
sponds to a path in G (for \x\ > k) or x is a substring of a 
vertex v eV (for \x\ < k). We write x ^ G to indicate that 
X is compatible with G. 

For convenience, we will include = oo^"^"^ in the 
vertex list, although it is strictly speaking not part of the 
graph, and apply the convention that oo^ corresponds to 
a path in G. The main effect of this is that the min{/ | . . .} 



expressions below take the value n if criteria for / are not 
otherwise met. 

Lemma 2. Let p(a, i) be defined as in equation (6) and y 
as in (7). For x a string m S* or E* o {$, oo}, 

y{x) = min{/ \ x < ViZ ^ G for some z G E*}. (20) 

For \x\ < k, this is just mm{i \ x < v/}, while for \x\ > k it 
requires a path starting at v/ (unless x < vj. 

Proof. For the empty string, y(e) = 0, making 
equation (20) true. For a e T^, y{a) = p(a,0), and 
equation (20) is essentially the definition of p; for y ($) =0 
the result follows as all v/ > $, while for y (oo) = n only 
Vn > OO. We will then complete the proof by induction on 
\x\ using (20) as the induction hypothesis. 

For 2 < \x\ < /c, write x = ay where a e E. By the 
induction hypothesis, we know that j < vj when j > y(y). 
Since \y\ < /c — 2, this is the same as 3/ < vf, and thus 
equivalent to x = ay < avj. If x = ay < v/, then either 
Vi = avJ for some ; in which case x < Vi = avJ and 
i ^ p(^J)> or no such vj exists in which case the first letter 
> a and / > p(a,n). The smallest / for which this 
makes x < v/ in either case corresponds to p{aj) where 

For \x\ > /c, again write x = ay where we know that 
y < VjZ ~ G when j > y(y). This means that x = ay < 
avjz = Vibz ~ G for / = p(aj) and where the condition 
7 ^ y(y) becomes / > p(aj). □ 
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